-
Notifications
You must be signed in to change notification settings - Fork 993
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: fix possible panic when 'SetNode' is called #1685
fix: fix possible panic when 'SetNode' is called #1685
Conversation
/assign @william-wang @k82cn @hzxuzhonghu |
fca3c3a
to
bea046b
Compare
@@ -187,10 +187,14 @@ func (r *Resource) Add(rr *Resource) *Resource { | |||
return r | |||
} | |||
|
|||
//Sub subtracts two Resource objects. | |||
// Sub subtracts two Resource objects with assertion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although this modification is clear, I think the original function is also acceptable.
/lgtm |
@william-wang @k82cn @hzxuzhonghu |
/hold
Is there any panic stack info for this issue? |
} | ||
|
||
// Dry run, make sure all fields other than `State` are in the original state. | ||
copy := ni.Clone() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why clone?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make make sure all fields are left untouched and in the original state, as i commented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you point where do we mutate this node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The original codes are all changing fields of nodeInfo.
- Here, the
clone
i used, is merely a dry-run to make sure if all these changes makes this node not ready then we shall not to actually do this operation and just set state of this node to notready.
|
@eggiter we need to fix func (r *Resource) Sub(rr *Resource) *Resource { |
Amazing there is panic in the functiuon, should remove all the stuff |
AFAIK, we'll ignore not ready Nodes; is there any e2e case to trigger the panic? |
|
can you guys leave this fix to me? and you can tell me the idea of how to resolve if this PR can not achieve this puerpose. |
@k82cn @hzxuzhonghu what's the current status? |
We should ignore such nodes during scheduling cycle. |
It will ignored by setting status of node to |
I think such nodes can act as CPU nodes and continue to work if device plugin doesn't work well. |
Sorry to tell you but, the machine can lost its memory occasionally. It'd be better to set status to |
Signed-off-by: lvhaodong <lvhaodong@kuaishou.com>
bea046b
to
83a58c3
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: k82cn The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm |
SetNode
is called, which may happen when gpus of one machine are lost during the restarting ofnvidia-device-plugin
;TestNodeInfo_SetNode
;